GEM-PRO - SBML Model

This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.

**Input:** GEM (in SBML, JSON, or MAT formats)

**Output:** GEM-PRO model

Imports



In [1]:

    
import sys
import logging



In [2]:

    
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO



In [3]:

    
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Logging

Set the logging level in logger.setLevel(logging.<LEVEL_HERE>) to specify how verbose you want the pipeline to be. Debug is most verbose.

CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don't affect running of the pipeline
INFO (default)
- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff

**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!



In [4]:

    
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #



In [5]:

    
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

Initialization of the project

Set these three things:

ROOT_DIR
- The directory where a folder named after your PROJECT will be created
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs

A directory will be created in ROOT_DIR with your PROJECT name. The folders are organized like so:

    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files

**Note:** Methods for protein complexes and metabolites are still in development.



In [6]:

    
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'mtuberculosis_gp'
GEM_FILE = '../../ssbio/test/test_files/models/iNJ661.json'
GEM_FILE_TYPE = 'json'
PDB_FILE_TYPE = 'mmtf'



In [7]:

    
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE, pdb_file_type=PDB_FILE_TYPE)









    



[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: /tmp/mtuberculosis_gp: GEM-PRO project location
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2018-02-05 18:13] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes

Mapping gene ID --> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

**Note:** You only need to map gene IDs using one service. However you can run both if some genes don't map in one service and do map in another!

However, you don't need to map using these services if you already have the amino acid sequences for each protein. You can just manually load in the sequences as shown using the method manual_seq_mapping. Or, if you already have the UniProt IDs, you can load those in using the method manual_uniprot_mapping.

Methods



In [8]:

    
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
                    'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)









    



[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences



In [9]:

    
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)









    





 
 










    









    



[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.






    Out[9]:







  
    
      
      uniprot
      reviewed
      gene_name
      kegg
      refseq
      pfam
      description
      entry_date
      entry_version
      seq_date
      seq_version
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Rv0619
      Q79FY3
      False
      galTb
      NaN
      NaN
      PF02744
      Probable galactose-1-phosphate uridylyltransfe...
      2017-07-05
      78
      2004-07-05
      1
      Q79FY3.fasta
      Q79FY3.xml
    
    
      Rv1755c
      P9WIA9
      False
      plcD
      NaN
      NaN
      PF04185
      Phospholipase C 4
      2017-07-05
      18
      2014-04-16
      1
      P9WIA9.fasta
      P9WIA9.xml
    
    
      Rv2321c
      P71891
      False
      rocD2
      mtv:RVBD_2321c
      WP_003411956.1
      PF00202
      Probable ornithine aminotransferase (C-terminu...
      2017-07-05
      116
      1997-02-01
      1
      P71891.fasta
      P71891.xml
    
    
      Rv2322c
      P71890
      False
      rocD1
      mtv:RVBD_2322c
      WP_003411957.1
      PF00202
      Probable ornithine aminotransferase (N-terminu...
      2017-06-07
      117
      1997-02-01
      1
      P71890.fasta
      P71890.xml



In [10]:

    
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()









    





 
 










    



[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2018-02-05 18:14] [ssbio.core.protein] WARNING: Rv2233: representative sequence does not match mapped KEGG sequence.
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available






    









    



[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.






    



Missing KEGG mapping:  ['Rv0618', 'Rv2233', 'Rv0619', 'Rv1755c', 'Rv2322c', 'Rv2321c']






    Out[10]:







  
    
      
      kegg
      refseq
      uniprot
      pdbs
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
      
    
  
  
    
      Rv0013
      mtu:Rv0013
      YP_177615
      I6WX77
      NaN
      mtu-Rv0013.faa
      mtu-Rv0013.kegg
    
    
      Rv0032
      mtu:Rv0032
      NP_214546
      I6Y6Q7
      NaN
      mtu-Rv0032.faa
      mtu-Rv0032.kegg
    
    
      Rv0046c
      mtu:Rv0046c
      NP_214560
      I6X8D3
      1GR0
      mtu-Rv0046c.faa
      mtu-Rv0046c.kegg
    
    
      Rv0066c
      mtu:Rv0066c
      NP_214580
      L0T2B7
      NaN
      mtu-Rv0066c.faa
      mtu-Rv0066c.kegg
    
    
      Rv0069c
      mtu:Rv0069c
      NP_214583
      A0A089S0Y8
      NaN
      mtu-Rv0069c.faa
      mtu-Rv0069c.kegg



In [11]:

    
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()









    



[2018-02-05 18:14] [root] INFO: getUserAgent: Begin
[2018-02-05 18:14] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:14] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.






    





 
 










    









    



[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes mapped to UniProt
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.






    



Missing UniProt mapping:  ['Rv2178c', 'Rv2476c', 'Rv3607c', 'Rv1202', 'Rv1030', 'Rv1187', 'Rv2421c', 'Rv2247', 'Rv2701c', 'Rv1662', 'Rv1380', 'Rv2763c', 'Rv2435c', 'Rv0082', 'Rv0189c', 'Rv3757c', 'Rv1133c', 'Rv1257c', 'Rv1623c', 'Rv0958', 'Rv1162', 'Rv3581c', 'Rv1739c', 'Rv1086', 'Rv1077', 'Rv1625c', 'Rv3227', 'Rv2832c', 'Rv3308', 'Rv2858c', 'Rv2193', 'Rv2834c', 'Rv2995c', 'Rv1915', 'Rv3003c', 'Rv1538c', 'Rv2793c', 'Rv2152c', 'Rv3468c', 'Rv1485', 'Rv1079', 'Rv1552', 'Rv1161', 'Rv3307', 'Rv0768', 'Rv1843c', 'Rv1302', 'Rv0780', 'Rv2064', 'Rv2124c', 'Rv1908c', 'Rv1555', 'Rv2754c', 'Rv0346c', 'Rv2467', 'Rv2384', 'Rv3758c', 'Rv1338', 'Rv1311', 'Rv1599', 'Rv1589', 'Rv2930', 'Rv0564c', 'Rv2070c', 'Rv0143c', 'Rv0107c', 'Rv2157c', 'Rv2847c', 'Rv1416', 'Rv0147', 'Rv0432', 'Rv3152', 'Rv0098', 'Rv0548c', 'Rv1408', 'Rv2320c', 'Rv2859c', 'Rv3314c', 'Rv3331', 'Rv3393', 'Rv0509', 'Rv2062c', 'Rv1248c', 'Rv1895', 'Rv2870c', 'Rv1131', 'Rv1294', 'Rv0337c', 'Rv0650', 'Rv1001', 'Rv2211c', 'Rv2471', 'Rv2075c', 'Rv1622c', 'Rv3045', 'Rv0772', 'Rv1607', 'Rv2601', 'Rv2363', 'Rv2483c', 'Rv1553', 'Rv1559', 'Rv3464', 'Rv3010c', 'Rv2552c', 'Rv1031', 'Rv1905c', 'Rv3815c', 'Rv2935', 'Rv3469c', 'Rv1595', 'Rv1293', 'Rv1099c', 'Rv3398c', 'Rv2764c', 'Rv1604', 'Rv2399c', 'Rv1695', 'Rv0253', 'Rv0046c', 'Rv3634c', 'Rv2982c', 'Rv0414c', 'Rv3410c', 'Rv3042c', 'Rv1659', 'Rv0013', 'Rv3582c', 'Rv2051c', 'Rv1082', 'Rv1373', 'Rv1570', 'Rv2236c', 'Rv1093', 'Rv1445c', 'Rv3602c', 'Rv2502c', 'Rv1436', 'Rv1849', 'Rv2382c', 'Rv0952', 'Rv2139', 'Rv0956', 'Rv3283', 'Rv3310', 'Rv0373c', 'Rv2243', 'Rv1612', 'Rv1307', 'Rv3229c', 'Rv3265c', 'Rv2220', 'Rv2786c', 'Rv0382c', 'Rv2583c', 'Rv2678c', 'Rv0896', 'Rv1207', 'Rv0183', 'Rv3913', 'Rv2940c', 'Rv3153', 'Rv1731', 'Rv0973c', 'Rv3247c', 'Rv0728c', 'Rv2987c', 'Rv3846', 'Rv3150', 'Rv0234c', 'Rv3794', 'Rv3436c', 'Rv2427c', 'Rv3257c', 'Rv2539c', 'Rv2965c', 'Rv2156c', 'Rv3791', 'Rv1296', 'Rv2287', 'Rv2899c', 'Rv3275c', 'Rv1837c', 'Rv1285', 'Rv1602', 'Rv2245', 'Rv0423c', 'Rv2208', 'Rv3801c', 'Rv0436c', 'Rv1492', 'Rv2780', 'Rv1658', 'Rv2674', 'Rv2329c', 'Rv3509c', 'Rv3290c', 'Rv0620', 'Rv1383', 'Rv1850', 'Rv0112', 'Rv2996c', 'Rv3396c', 'Rv2671', 'Rv1653', 'Rv2496c', 'Rv3330', 'Rv1663', 'Rv0820', 'Rv1448c', 'Rv2205c', 'Rv1447c', 'Rv2043c', 'Rv3806c', 'Rv2029c', 'Rv3792', 'Rv3313c', 'Rv1822', 'Rv3470c', 'Rv1609', 'Rv0103c', 'Rv3264c', 'Rv1484', 'Rv3534c', 'Rv1603', 'Rv0803', 'Rv2849c', 'Rv2458', 'Rv3341', 'Rv1617', 'Rv0570', 'Rv3609c', 'Rv1512', 'Rv1672c', 'Rv3759c', 'Rv2201', 'Rv0855', 'Rv3236c', 'Rv1940', 'Rv2163c', 'Rv0374c', 'Rv1563c', 'Rv0375c', 'Rv2455c', 'Rv1310', 'Rv1295', 'Rv2702', 'Rv3356c', 'Rv2445c', 'Rv3273', 'Rv1306', 'Rv3423c', 'Rv2589', 'Rv0317c', 'Rv3303c', 'Rv2439c', 'Rv1304', 'Rv3248c', 'Rv0858c', 'Rv0534c', 'Rv3158', 'Rv1605', 'Rv1127c', 'Rv1692', 'Rv1655', 'Rv2192c', 'Rv1391', 'Rv1122', 'Rv2454c', 'Rv2573', 'Rv2949c', 'Rv2379c', 'Rv1240', 'Rv1098c', 'Rv2196', 'Rv0555', 'Rv1350', 'Rv3302c', 'Rv1309', 'Rv0267', 'Rv0255c', 'Rv2194', 'Rv3051c', 'Rv0524', 'Rv2590', 'Rv2465c', 'Rv0137c', 'Rv1601', 'Rv0437c', 'Rv1554', 'Rv1318c', 'Rv3340', 'Rv1389', 'Rv2934', 'Rv0545c', 'Rv0558', 'Rv1005c', 'Rv1412', 'Rv2127', 'Rv2335', 'Rv2392', 'Rv2981c', 'Rv3709c', 'Rv3754', 'Rv2612c', 'Rv2931', 'Rv0557', 'Rv0252', 'Rv2928', 'Rv1347c', 'Rv1449c', 'Rv1600', 'Rv2207', 'Rv0536', 'Rv0295c', 'Rv3455c', 'Rv3411c', 'Rv0503c', 'Rv3708c', 'Rv0417', 'Rv1704c', 'Rv0489', 'Rv3795', 'Rv0553', 'Rv0573c', 'Rv2498c', 'Rv0470c', 'Rv0773c', 'Rv3048c', 'Rv2933', 'Rv3704c', 'Rv3818', 'Rv1737c', 'Rv0266c', 'Rv0644c', 'Rv2130c', 'Rv3280', 'Rv2400c', 'Rv2182c', 'Rv1018c', 'Rv2158c', 'Rv3149', 'Rv2289', 'Rv0126', 'Rv2967c', 'Rv3214', 'Rv2317', 'Rv3490', 'Rv2881c', 'Rv1916', 'Rv1438', 'Rv1348', 'Rv3808c', 'Rv0334', 'Rv1594', 'Rv3156', 'Rv2072c', 'Rv2344c', 'Rv2210c', 'Rv2281', 'Rv0993', 'Rv2316', 'Rv1406', 'Rv1613', 'Rv1201c', 'Rv0091', 'Rv3318', 'Rv3285', 'Rv0409', 'Rv0974c', 'Rv3316', 'Rv2121c', 'Rv2833c', 'Rv3309c', 'Rv1328', 'Rv3266c', 'Rv1213', 'Rv2697c', 'Rv0946c', 'Rv1832', 'Rv3793', 'Rv1011', 'Rv3696c', 'Rv2246', 'Rv3858c', 'Rv1620c', 'Rv2386c', 'Rv1618', 'Rv0211', 'Rv1631', 'Rv0512', 'Rv1385', 'Rv1286', 'Rv2122c', 'Rv3319', 'Rv2249c', 'Rv2835c', 'Rv2941', 'Rv2605c', 'Rv2394', 'Rv3145', 'Rv0886', 'Rv3157', 'Rv3737', 'Rv1094', 'Rv1820', 'Rv0500', 'Rv1236', 'Rv1409', 'Rv2136c', 'Rv3800c', 'Rv2984', 'Rv0118c', 'Rv1121', 'Rv2932', 'Rv1826', 'Rv2233', 'Rv1511', 'Rv2540c', 'Rv3838c', 'Rv1305', 'Rv2200c', 'Rv3146', 'Rv0389', 'Rv0848', 'Rv3279c', 'Rv1529', 'Rv0533c', 'Rv0951', 'Rv0884c', 'Rv1308', 'Rv0482', 'Rv2259', 'Rv0727c', 'Rv0261c', 'Rv3281', 'Rv3756c', 'Rv1475c', 'Rv0462', 'Rv1551', 'Rv0805', 'Rv1656', 'Rv2398c', 'Rv1392', 'Rv1170', 'Rv2438c', 'Rv3379c', 'Rv1562c', 'Rv0066c', 'Rv2726c', 'Rv1844c', 'Rv1699', 'Rv2231c', 'Rv3002c', 'Rv3441c', 'Rv1029', 'Rv3001c', 'Rv1415', 'Rv3148', 'Rv2378c', 'Rv2958c', 'Rv2900c', 'Rv0505c', 'Rv3043c', 'Rv1200', 'Rv1185c', 'Rv2531c', 'Rv3772', 'Rv2947c', 'Rv1654', 'Rv0486', 'Rv2436', 'Rv2977c', 'Rv0511', 'Rv1848', 'Rv1237', 'Rv3601c', 'Rv0733', 'Rv3155', 'Rv0753c', 'Rv0694', 'Rv3215', 'Rv3710', 'Rv3606c', 'Rv0254c', 'Rv3315c', 'Rv3713', 'Rv0155', 'Rv0859', 'Rv3317', 'Rv2222c', 'Rv2495c', 'Rv1745c', 'Rv2497c', 'Rv0771', 'Rv1320c', 'Rv3842c', 'Rv1349', 'Rv2773c', 'Rv1164', 'Rv1381', 'Rv0363c', 'Rv2682c', 'Rv3113', 'Rv3154', 'Rv2860c', 'Rv3588c', 'Rv3608c', 'Rv3535c', 'Rv2318', 'Rv0777', 'Rv0729', 'Rv1714', 'Rv0645c', 'Rv0853c', 'Rv0642c', 'Rv0162c', 'Rv2992c', 'Rv3068c', 'Rv1621c', 'Rv0522', 'Rv1568', 'Rv3465', 'Rv2584c', 'Rv2746c', 'Rv0248c', 'Rv0542c', 'Rv2383c', 'Rv0391', 'Rv2332', 'Rv1264', 'Rv2065', 'Rv1652', 'Rv2291', 'Rv1902c', 'Rv1928c', 'Rv0478', 'Rv0467', 'Rv2713', 'Rv1017c', 'Rv0501', 'Rv0422c', 'Rv1023', 'Rv3624c', 'Rv2071c', 'Rv2964', 'Rv0357c', 'Rv2195', 'Rv1647', 'Rv3293', 'Rv2006', 'Rv3826', 'Rv0499', 'Rv0032', 'Rv2988c', 'Rv2155c', 'Rv3372', 'Rv1712', 'Rv1336', 'Rv0306', 'Rv2397c', 'Rv1611', 'Rv3859c', 'Rv0889c', 'Rv2677c', 'Rv0069c', 'Rv2945c', 'Rv1238', 'Rv0408', 'Rv3276c', 'Rv2361c', 'Rv0510', 'Rv1569', 'Rv2610c', 'Rv0812', 'Rv2962c', 'Rv0649', 'Rv2381c', 'Rv2538c', 'Rv1872c', 'Rv3628', 'Rv0191', 'Rv2920c', 'Rv3777', 'Rv0808', 'Rv2447c', 'Rv2607', 'Rv2611c', 'Rv3790', 'Rv3147', 'Rv2380c', 'Rv2753c', 'Rv1323', 'Rv3255c', 'Rv2202c', 'Rv2443', 'Rv0247c', 'Rv1437', 'Rv3784', 'Rv0156', 'Rv1188', 'Rv0322', 'Rv2957', 'Rv0957', 'Rv3332', 'Rv0157', 'Rv3339c', 'Rv3809c', 'Rv2523c', 'Rv0468', 'Rv0936', 'Rv2501c', 'Rv3106', 'Rv0904c', 'Rv1878', 'Rv1451', 'Rv0794c', 'Rv1885c', 'Rv2524c', 'Rv2153c', 'Rv1606', 'Rv3432c', 'Rv0843', 'Rv0321', 'Rv2883c', 'Rv2334', 'Rv1315', 'Rv0819', 'Rv3667', 'Rv0809', 'Rv1596', 'Rv0084', 'Rv2066', 'Rv2391', 'Rv1319c', 'Rv1163', 'Rv2504c', 'Rv1239c', 'Rv3565', 'Rv2215', 'Rv2855', 'Rv0824c', 'Rv2537c', 'Rv1464', 'Rv2848c', 'Rv2225', 'Rv0070c', 'Rv2241', 'Rv1493', 'Rv1981c', 'Rv2503c', 'Rv2388c', 'Rv1092c', 'Rv1483', 'Rv0788', 'Rv0860']






    Out[11]:







  
    
      
      uniprot
      reviewed
      gene_name
      kegg
      refseq
      pfam
      description
      entry_date
      entry_version
      seq_date
      seq_version
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Rv0618
      Q79FY4
      False
      galTa
      mtv:RVBD_0618
      WP_003900189.1
      PF01087
      Probable galactose-1-phosphate uridylyltransfe...
      2017-07-05
      87
      2004-07-05
      1
      Q79FY4.fasta
      Q79FY4.xml
    
    
      Rv0619
      Q79FY3
      False
      galTb
      NaN
      NaN
      PF02744
      Probable galactose-1-phosphate uridylyltransfe...
      2017-07-05
      78
      2004-07-05
      1
      Q79FY3.fasta
      Q79FY3.xml
    
    
      Rv1755c
      P9WIA9
      False
      plcD
      NaN
      NaN
      PF04185
      Phospholipase C 4
      2017-07-05
      18
      2014-04-16
      1
      P9WIA9.fasta
      P9WIA9.xml
    
    
      Rv2321c
      P71891
      False
      rocD2
      mtv:RVBD_2321c
      WP_003411956.1
      PF00202
      Probable ornithine aminotransferase (C-terminu...
      2017-07-05
      116
      1997-02-01
      1
      P71891.fasta
      P71891.xml
    
    
      Rv2322c
      P71890
      False
      rocD1
      mtv:RVBD_2322c
      WP_003411957.1
      PF00202
      Probable ornithine aminotransferase (N-terminu...
      2017-06-07
      117
      1997-02-01
      1
      P71890.fasta
      P71890.xml

If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.

If any sequences or IDs were provided manually, these will be set as representative first.
UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn't.



In [12]:

    
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()









    





 
 










    









    



[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.






    



Missing a representative sequence:  []






    Out[12]:







  
    
      
      uniprot
      kegg
      pdbs
      sequence_file
      metadata_file
    
    
      gene
      
      
      
      
      
    
  
  
    
      Rv0013
      I6WX77
      mtu:Rv0013
      NaN
      mtu-Rv0013.faa
      mtu-Rv0013.kegg
    
    
      Rv0032
      I6Y6Q7
      mtu:Rv0032
      NaN
      mtu-Rv0032.faa
      mtu-Rv0032.kegg
    
    
      Rv0046c
      I6X8D3
      mtu:Rv0046c
      1GR0
      mtu-Rv0046c.faa
      mtu-Rv0046c.kegg
    
    
      Rv0066c
      L0T2B7
      mtu:Rv0066c
      NaN
      mtu-Rv0066c.faa
      mtu-Rv0066c.kegg
    
    
      Rv0069c
      A0A089S0Y8
      mtu:Rv0069c
      NaN
      mtu-Rv0069c.faa
      mtu-Rv0069c.kegg

Mapping representative sequence --> structure

These are the ways to map sequence to structure:

Use the UniProt ID and their automatic mappings to the PDB
BLAST the sequence to the PDB
Make homology models or
Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

Methods



In [13]:

    
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()









    



[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:15] [root] INFO: getUserAgent: Begin
[2018-02-05 18:15] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:15] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.






    





 
 










    









    



[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes with at least one experimental structure
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] WARNING: Empty dataframe






    Out[13]:



In [14]:

    
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)









    





 
 










    









    



[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 141: number of genes with additional structures added from BLAST






    Out[14]:







  
    
      
      pdb_id
      pdb_chain_id
      hit_score
      hit_evalue
      hit_percent_similar
      hit_percent_ident
      hit_num_ident
      hit_num_similar
    
    
      gene
      
      
      
      
      
      
      
      
    
  
  
    
      Rv0046c
      1gr0
      A
      1861.0
      0.0
      1.000000
      1.000000
      367
      367
    
    
      Rv0066c
      5kvu
      D
      3828.0
      0.0
      0.981208
      0.981208
      731
      731



In [15]:

    
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'

##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###

# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()









    





 
 










    









    



[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.






    





 
 










    









    



[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.






    Out[15]:







  
    
      
      id
      structure_file
      model_date
      difficulty
      top_template_pdb
      top_template_chain
      c_score
      tm_score
      tm_score_err
      rmsd
      rmsd_err
    
    
      gene
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Rv0013
      P9WN35
      P9WN35_model1.pdb
      2018-02-06
      easy
      1i7s
      B
      -0.53
      0.65
      0.13
      6.8
      4.0
    
    
      Rv0032
      P9WQ85
      P9WQ85_model1.pdb
      2018-02-06
      easy
      3a2b
      A
      -2.89
      0.39
      0.13
      15.7
      3.3
    
    
      Rv0066c
      O53611
      O53611_model1.pdb
      2018-02-06
      easy
      1itw
      A
      1.91
      0.99
      0.04
      4.1
      2.8
    
    
      Rv0069c
      P9WGT5
      P9WGT5_model1.pdb
      2018-02-06
      easy
      4rqo
      A
      1.18
      0.88
      0.07
      4.6
      3.0
    
    
      Rv0070c
      P9WGI7
      P9WGI7_model1.pdb
      2018-02-06
      easy
      3h7f
      B
      1.80
      0.97
      0.05
      3.3
      2.3



In [16]:

    
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)









    





 
 










    









    



[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.

Downloading and ranking structures

Methods

**Warning:** Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.



In [ ]:

    
# Download all mapped PDBs and gather the metadata
my_gempro.download_all_pdbs()
my_gempro.df_pdb_metadata.head(2)



In [17]:

    
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()









    





 
 










    



[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0234c: no structures meet quality checks
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0505c: no structures meet quality checks
[2018-02-05 18:17] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks






    









    



[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: 553/661: number of genes with a representative structure
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.






    Out[17]:







  
    
      
      id
      is_experimental
      file_type
      structure_file
    
    
      gene
      
      
      
      
    
  
  
    
      Rv0013
      REP-P9WN35
      False
      pdb
      P9WN35_model1-X_clean.pdb
    
    
      Rv0032
      REP-P9WQ85
      False
      pdb
      P9WQ85_model1-X_clean.pdb
    
    
      Rv0046c
      REP-1gr0
      True
      pdb
      1gr0-A_clean.pdb
    
    
      Rv0066c
      REP-5kvu
      True
      pdb
      5kvu-A_clean.pdb
    
    
      Rv0069c
      REP-P9WGT5
      False
      pdb
      P9WGT5_model1-X_clean.pdb



In [18]:

    
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()









    Out[18]:





<StructProp REP-2d1f at 0x7fdec937e4a8>






    Out[18]:





{'_structure_dir': '/tmp/mtuberculosis_gp/genes/Rv1295/Rv1295_protein/structures',
 'chains': [<ChainProp A at 0x7fdec8655630>],
 'date': None,
 'description': 'Threonine synthase (E.C.4.2.3.1)',
 'file_type': 'pdb',
 'id': 'REP-2d1f',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'notes': {},
 'original_structure_id': '2d1f',
 'resolution': 2.5,
 'structure_file': '2d1f-A_clean.pdb',
 'taxonomy_name': 'Mycobacterium tuberculosis'}

Creating homology models

For those proteins with no representative structure, we can create homology models for them. ssbio contains some built in functions for easily running I-TASSER locally or on machines with SLURM (ie. on NERSC) or Torque job scheduling.

You can load in I-TASSER models once they complete using the get_itasser_models later.

**Info:** Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).

Methods



In [19]:

    
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)









    



[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 108 genes in folder /tmp/mtuberculosis_gp/data/homology_models

Saving your GEM-PRO

Finally, you can save your GEM-PRO as a JSON or pickle file, so you don't have to run the pipeline again.

For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!

**Warning:** Saving in JSON format is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.



In [20]:

    
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))



In [21]:

    
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)









    



[2018-02-05 18:18] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:18] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: mtuberculosis_gp) to /tmp/mtuberculosis_gp/model/mtuberculosis_gp.json

Loading a saved GEM-PRO



In [ ]:

    
# Loading a pickle file
import pickle
with open('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl', 'rb') as f:
    my_saved_gempro = pickle.load(f)



In [ ]:

    
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)

	uniprot	reviewed	gene_name	kegg	refseq	pfam	description	entry_date	entry_version	seq_date	seq_version	sequence_file	metadata_file
gene
Rv0619	Q79FY3	False	galTb	NaN	NaN	PF02744	Probable galactose-1-phosphate uridylyltransfe...	2017-07-05	78	2004-07-05	1	Q79FY3.fasta	Q79FY3.xml
Rv1755c	P9WIA9	False	plcD	NaN	NaN	PF04185	Phospholipase C 4	2017-07-05	18	2014-04-16	1	P9WIA9.fasta	P9WIA9.xml
Rv2321c	P71891	False	rocD2	mtv:RVBD_2321c	WP_003411956.1	PF00202	Probable ornithine aminotransferase (C-terminu...	2017-07-05	116	1997-02-01	1	P71891.fasta	P71891.xml
Rv2322c	P71890	False	rocD1	mtv:RVBD_2322c	WP_003411957.1	PF00202	Probable ornithine aminotransferase (N-terminu...	2017-06-07	117	1997-02-01	1	P71890.fasta	P71890.xml

	kegg	refseq	uniprot	pdbs	sequence_file	metadata_file
gene
Rv0013	mtu:Rv0013	YP_177615	I6WX77	NaN	mtu-Rv0013.faa	mtu-Rv0013.kegg
Rv0032	mtu:Rv0032	NP_214546	I6Y6Q7	NaN	mtu-Rv0032.faa	mtu-Rv0032.kegg
Rv0046c	mtu:Rv0046c	NP_214560	I6X8D3	1GR0	mtu-Rv0046c.faa	mtu-Rv0046c.kegg
Rv0066c	mtu:Rv0066c	NP_214580	L0T2B7	NaN	mtu-Rv0066c.faa	mtu-Rv0066c.kegg
Rv0069c	mtu:Rv0069c	NP_214583	A0A089S0Y8	NaN	mtu-Rv0069c.faa	mtu-Rv0069c.kegg

	uniprot	reviewed	gene_name	kegg	refseq	pfam	description	entry_date	entry_version	seq_date	seq_version	sequence_file	metadata_file
gene
Rv0618	Q79FY4	False	galTa	mtv:RVBD_0618	WP_003900189.1	PF01087	Probable galactose-1-phosphate uridylyltransfe...	2017-07-05	87	2004-07-05	1	Q79FY4.fasta	Q79FY4.xml
Rv0619	Q79FY3	False	galTb	NaN	NaN	PF02744	Probable galactose-1-phosphate uridylyltransfe...	2017-07-05	78	2004-07-05	1	Q79FY3.fasta	Q79FY3.xml
Rv1755c	P9WIA9	False	plcD	NaN	NaN	PF04185	Phospholipase C 4	2017-07-05	18	2014-04-16	1	P9WIA9.fasta	P9WIA9.xml
Rv2321c	P71891	False	rocD2	mtv:RVBD_2321c	WP_003411956.1	PF00202	Probable ornithine aminotransferase (C-terminu...	2017-07-05	116	1997-02-01	1	P71891.fasta	P71891.xml
Rv2322c	P71890	False	rocD1	mtv:RVBD_2322c	WP_003411957.1	PF00202	Probable ornithine aminotransferase (N-terminu...	2017-06-07	117	1997-02-01	1	P71890.fasta	P71890.xml

	id	structure_file	model_date	difficulty	top_template_pdb	top_template_chain	c_score	tm_score	tm_score_err	rmsd	rmsd_err
gene
Rv0013	P9WN35	P9WN35_model1.pdb	2018-02-06	easy	1i7s	B	-0.53	0.65	0.13	6.8	4.0
Rv0032	P9WQ85	P9WQ85_model1.pdb	2018-02-06	easy	3a2b	A	-2.89	0.39	0.13	15.7	3.3
Rv0066c	O53611	O53611_model1.pdb	2018-02-06	easy	1itw	A	1.91	0.99	0.04	4.1	2.8
Rv0069c	P9WGT5	P9WGT5_model1.pdb	2018-02-06	easy	4rqo	A	1.18	0.88	0.07	4.6	3.0
Rv0070c	P9WGI7	P9WGI7_model1.pdb	2018-02-06	easy	3h7f	B	1.80	0.97	0.05	3.3	2.3

	id	is_experimental	file_type	structure_file
gene
Rv0013	REP-P9WN35	False	pdb	P9WN35_model1-X_clean.pdb
Rv0032	REP-P9WQ85	False	pdb	P9WQ85_model1-X_clean.pdb
Rv0046c	REP-1gr0	True	pdb	1gr0-A_clean.pdb
Rv0066c	REP-5kvu	True	pdb	5kvu-A_clean.pdb
Rv0069c	REP-P9WGT5	False	pdb	P9WGT5_model1-X_clean.pdb